Multithreaded build #495

novoselrok · 2020-07-28T08:58:50Z

Issue: #70

This is my work-in-progress branch for supporting multithreaded builds in Annoy. Locally (Ubuntu 18.04) all C++, Python and Lua tests seem to pass (I've re-ran them multiple times). The only thing remaining is fixing the Go build and I would appreciate some help with this 😄

I've ran some speed tests on glove-angular-25 index with n_trees=32. Reference time on master (without threads) was 49.9s.
Here are results for multiple threads:

Num. threads	Time (s)
1	51.28759694
2	26.675179
4	14.82837749
8	8.897883892
16	7.353658915

If you have any questions or concerns feel free to comment.

novoselrok · 2020-07-28T10:38:57Z

Summary of CI issues:

Missing a cross-platform implementation of isnan (fixed by replacing nan checks with checks before sqrt)
Missing -std=c++11 flag when compiling Lua and Go extensions
Windows Python 2.7 cannot find mutex library

erikbern · 2020-07-28T12:53:03Z

src/annoylib.h

+ for (int i = 0; i < n_threads; i++) {
+ int trees_per_thread = -1;
+ if (q > -1) {
+ // First thread picks up the remainder of the work


this might lead to an uneven split, eg if q=20 and n_threads=7 then the distribution will be [8, 2, 2, 2, 2, 2, 2]

I think you can do trees_per_thread = work_per_thread + (i < work_remainder) and it should split it like [3, 3, 3, 3, 3, 3, 2]

actually even simpler you can just to trees_per_thread = (q + i) / n_threads and you don't even have to define work_pre_thread

Awesome, I was not exactly happy with my solution, so I'll reuse this.

erikbern · 2020-07-28T12:54:17Z

src/annoylib.h

@@ -1172,7 +1206,87 @@ template<typename S, typename T, typename Distance, typename Random>
 return get_node_ptr<S, Node>(_nodes, _s, i);
 }

- S _make_tree(const vector<S >& indices, bool is_root) {
+ void _thread_build(int q, int thread_idx, ThreadBarrier& barrier, std::mutex& _nodes_mutex) {


i think it would be good to have the same code for threaded and non-threaded building

I'm not exactly sure what you mean by this: should we still support building without threads (e.g. when n_threads=1)?

I'd actually be ok only supporting threaded build so that we can remove the non-threaded build code

I'm probably missing something, but _thread_build is just a private helper function that was extracted from the previous build function. It is there just so I don't have to stuck everything inside a lambda function when spawning a new thread.

Could you point out which part of the code could be potentially removed? 😄

I think _thread_build could be a pretty simple wrapper around _make_tree. I think the only two places that touch any state are:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L1185

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L1252

I don't think you need the magic where you go into nodes and add offsets etc? So all _thread_build would have to do is to keep adding trees until you reach the limit.

Yeah exactly, that's what i was thinking.

Your approach at https://github.com/spotify/annoy/pull/495/files#diff-91eaadc9d0aa248375a22a64f315e3d8R1250 is very clever but i'm a bit nervous it's too hacky and I'd love to have something slightly simpler :)

Doesn't your barrier thing roughly accomplish what at shared_mutex does? Or it should at least be possible to modify it into sort of a shared mutex? Haven't thought about it enough.

I agree, I would definitely feel more comfortable if I could simplify it a bit 😄

Unfortunately a barrier does not accomplish what a shared_mutex is supposed to do. It can be only used as a global synchronization point, not for reader/writer style locking.

After reading a couple of SO threads, a custom shared mutex could prove really tricky to implement correctly. We could use the shared_timed_mutex which has a few additional capabilities and is available from C++14. Either way, I'll implement the less-magic version we discussed and then we can compare the performance and decide if it's worth it to bump the C++ version.

Well, I managed to simplify my solution in the latest commit 😄 I removed all of the magic and introduced a couple of more mutex variables and a shared_timed_mutex variable.

I've re-ran my time measurements on glove-angular-25:

Num. threads n_trees=32 time (s) n_trees=64 time (s)

1 49.286592960357666 99.88854050636292

2 26.54487133026123 53.29321575164795

4 15.026970863342285 29.342100143432617

8 10.422403812408447 18.966835498809814

16 9.757703065872192 17.069363117218018

Wow amazing! Any idea if there's any slowdown compared to the current master?

The new code looks much cleaner and less "clever". I don't know how well supported c++14 is but hopefully it's not a big problem.

Do you really need so many different mutexes? Not a concern, was just surprised to see 4 different ones

I can prepare a more comprehensive timing benchmark and include a comparison with the current master. But as far as I can tell, there isn't a significant slowdown.

Regarding the number of mutexes: If I guard everything with the shared nodes_mutex just one thread has to acquire an exclusive lock and everything grinds to a halt. So I opted for a more granular approach, where I exclusively lock the nodes_mutex only when absolutely necessary and add a separate regular locks for all other shared variables (nodes_size, _n_nodes). I'll see if I can cut down on the number of mutexes, but I currently can't see a better way.

erikbern · 2020-07-28T12:54:47Z

src/annoylib.h

+ std::mutex _nodes_mutex;
+ ThreadBarrier barrier(n_threads);
+ vector<std::thread> threads(n_threads);
+ int work_per_thread = (int)floor(q / (double)n_threads);


what happens if q = -1? you probably have to determine q before this code in case it's set to -1

erikbern · 2020-07-28T12:55:38Z

this is super cool! i left a few minor comment, but generally this is great

erikbern · 2020-07-28T12:56:35Z

src/annoylib.h

@@ -929,7 +952,7 @@ template<typename S, typename T, typename Distance, typename Random>
 return true;
 }

- bool build(int q, char** error=NULL) {
+ bool build(int q, int n_threads=1, char** error=NULL) {


might be good to default it to -1 and if that's the case set it to the number of cpus

erikbern · 2020-07-28T13:01:17Z

src/annoylib.h

+ vector<vector<Node*> > thread_trees;
+ vector<S> thread_roots;
+ while (1) {
+ if (q == -1) {


this code to determine q if q = -1 could probably just be left in the build function

The thing with q = -1, is that I don't know beforehand how many trees each thread should build, because it depends on the number of nodes. That is why I have to check for -1 everywhere.

A better solution would be to introduce a different meaning to q = -1, where each thread would build just one tree for example.

Ah yeah good point. I forgot about that. Ok if your code works then let's keep it, I guess

novoselrok · 2020-07-28T15:49:11Z

@erikbern Could you take a look at the failing Go and Lua builds, if you'll be able to help? I'm not sure how to get the -std=c++11 flag into the compilation.

Windows Python 2.7 might be tricky, because it seems it doesn't support C++11 thread/mutex at all.

erikbern · 2020-07-28T15:53:31Z

src/annoylib.h

+ }
+
+ // Wait for all threads to finish before we can start inserting tree nodes into global _nodes array
+ barrier.wait();


isn't locking the mutex enough to enforce this?

Unfortunately not, due to reasons noted here: #495 (comment) We have to wait for all threads finish their work before we can mutate the global _nodes buffer.

pavlin-policar · 2020-07-28T18:28:09Z

src/annoymodule.cc

 if (!self->ptr) 
 return NULL;
- static char const * kwlist[] = {"n_trees", NULL};
- if (!PyArg_ParseTupleAndKeywords(args, kwargs, "i", (char**)kwlist, &q))
+ static char const * kwlist[] = {"n_trees", "n_threads", NULL};


Just an aside -- what do you think about naming this parameter in the Python wrapper to n_jobs instead of n_threads? n_jobs seems more consistent with the Python ecosystem. Either way, I think this is a great improvement!

Sure, that makes sense. I can expose n_jobs keyword argument for Python, but keep n_threads everywhere else.

…h uses all CPU cores. Rename n_threads to n_jobs in the Python interface.

erikbern · 2020-07-31T16:29:06Z

This looks quite amazing so far. Would be good to figure out the CI failures

novoselrok · 2020-07-31T16:41:22Z

This looks quite amazing so far. Would be good to figure out the CI failures

As far as I can tell Windows Py2.7 does not support the C++ thread/mutex libraries in any way. For Go and Lua I need to add the -std=c++14 flag to the compilation step, but I'm not really sure how to do that.

erikbern · 2020-07-31T20:12:27Z

maybe there's some dumb way so that if those headers aren't present, it just defaults to single-threaded? not sure if that's a lot of work. i'd imagine there would be a lot of #ifdef scattered throughout the code

erikbern · 2020-07-31T20:13:27Z

FYI the tests timed out on OSX which implies there might be some race condition or similar? https://travis-ci.org/github/spotify/annoy/jobs/713675853

…aded policies. Added ANNOYLIB_MULTITHREADED_BUILD which controls which threading policy to use.

erikbern · 2020-08-02T16:58:21Z

nice, the last change is 👌

novoselrok · 2020-08-02T17:40:11Z

nice, the last change is ok_hand

Feel free to take another look and see if anything can be improved. OSX is going wonky, I'm still investigating the issue.

novoselrok · 2020-08-02T18:13:59Z

I think I figured out the OSX issue. Apparently locking/unlocking is much more expensive under OSX and I'm doing it excessively in a loop when adding indices. Moving the lock around the loop should do the trick 🤞

…because OSX should be faster now.

erikbern · 2020-08-02T18:49:27Z

this looks great – i'm happy to merge whenever you give 👍

novoselrok · 2020-08-02T19:13:19Z

Awesome 👍 I'll just prepare a short timing comparison with master sometime tomorrow. If there aren't any more surprises, it'll be ready to go.

novoselrok · 2020-08-03T09:21:50Z

Here is a multihreaded build timing comparison for angular-glove-25 dataset:

Num. threads	n_trees=16 time (s)	n_trees=32 time (s)	n_trees=64 time (s)	n_trees=128 time (s)
1	24.35913038	49.1573956	98.2360599	197.1160657
2	13.40727019	25.64709997	51.36644101	104.5556092
4	7.058898449	14.07375836	29.17391777	58.58751988
8	4.247594595	8.404431581	17.18267322	33.64984083
16	3.579955816	7.196531773	14.31243014	28.61186814

Here are single threaded timings from master for comparison:

Num. trees	Time (s)
16	28.233051300048828
32	53.763163566589355
64	102.16749143600464
128	212.41096830368042

The speed up is not linear, but it's good enough especially for large number of trees.

I think it's ready to 🚢

erikbern · 2020-08-03T13:43:36Z

Awesome, I’ll merge this now.

I won’t publish this to pypi for a few weeks just in case there’s some issues. There’s some people using master so that’s effectively a form of “beta testing”

Really appreciate the work!

novoselrok · 2020-08-03T13:58:27Z

Awesome, sounds good! 😄

n1ru4l · 2020-11-06T15:34:52Z

Hey there, we are already using the multithreaded build with python and love the speed improvements! Now we are even thinking about migrating to golang in order to have better performance when loading the data before inserting it into the tree.

As far as I can tell Windows Py2.7 does not support the C++ thread/mutex libraries in any way. For Go and Lua I need to add the -std=c++14 flag to the compilation step, but I'm not really sure how to do that.

Is there a specific issue that blocked this from getting implemented? Do you plan to follow-up with a PR that enabled multithreaded builds, or should I try sending a PR?

novoselrok · 2020-11-06T17:31:10Z

Hey @n1ru4l, no I wasn't planning on it. You can give it a go, if you would like :) As I've mentioned, you would only need to supply the -std=c++14 flag when compiling for Go and possibly update the Go version in tests (https://github.com/spotify/annoy/blob/master/tox.ini#L13).

n1ru4l · 2020-11-06T18:21:06Z

@novoselrok That sounds easy in theory :D I will give it a try!

n1ru4l · 2020-11-09T22:12:27Z

So we decided against using golang. I won't implement this any time soon then.

utkarshgupta137 · 2021-08-20T18:23:55Z

I'm running this in docker on ubuntu, but I'm only seeing 1 CPU core used at a time (100%), the rest are idle (0%).
I'm passing n_jobs = -1 & also tried to set it to 6 (which is the number of cores), but got the same result.

When I'm training a tensorflow model in the same docker, it uses 100% of the CPUs, so I don't think docker configuration is the issue.

index = AnnoyIndex(128, "angular")
for idx, doc in enumerate(weights):  # len(weights) = ~120K
    index.add_item(idx, doc)
index.build(256)

index.save(f"{loc_model}/annoy_index.ann")

AniruddhaHumane · 2023-01-19T01:55:50Z

I am facing same issue as @utkarshgupta137, did you find any solution for this?

utkarshgupta137 · 2023-01-19T20:31:00Z

I am facing same issue as @utkarshgupta137, did you find any solution for this?

It has been a while, but I don't think I did.

AniruddhaHumane · 2023-01-20T02:47:30Z

I see. That's unfortunate. I have created an issue here -> #627 hopefully the author will respond.

novoselrok added 5 commits July 28, 2020 10:50

Multithreaded build for C++, Python and Lua.

3fa3fd1

Add std=c++11 to extra_compile_args.

d3b5914

Use std::isnan.

f4f9ae1

Remove typo.

4dd2fca

Replace isnan with macro.

c913236

Replace isnan with checks before sqrt.

1297412

erikbern reviewed Jul 28, 2020

View reviewed changes

Better work per thread calculation.

4613202

erikbern reviewed Jul 28, 2020

View reviewed changes

pavlin-policar reviewed Jul 28, 2020

View reviewed changes

novoselrok added 2 commits July 31, 2020 14:37

Simplify the multithreaded solution. Use n_threads=-1 as default whic…

add300c

…h uses all CPU cores. Rename n_threads to n_jobs in the Python interface.

Bump macosx version to 10.12.

51e5768

novoselrok added 2 commits August 2, 2020 18:13

Extracted threading specific code into single threaded and multi thre…

284dda9

…aded policies. Added ANNOYLIB_MULTITHREADED_BUILD which controls which threading policy to use.

Merge branch 'master' into multithreaded-build

283562a

Set n_jobs=1 in test_very_large_index due to OSX.

26d3756

Move nodes lock around the loop when adding indices. Remove n_jobs=1 …

16cbf70

…because OSX should be faster now.

Update READMEs and Lua rockspec.

bf51acf

novoselrok changed the title ~~WIP: Multithreaded build~~ Multithreaded build Aug 3, 2020

erikbern merged commit 5caba52 into spotify:master Aug 3, 2020

pavlin-policar mentioned this pull request Oct 17, 2020

Update to annoy v1.17.0 pavlin-policar/openTSNE#144

Merged

3 tasks

novoselrok deleted the multithreaded-build branch November 6, 2020 17:26

Num. threads	n_trees=32 time (s)	n_trees=64 time (s)
1	49.286592960357666	99.88854050636292
2	26.54487133026123	53.29321575164795
4	15.026970863342285	29.342100143432617
8	10.422403812408447	18.966835498809814
16	9.757703065872192	17.069363117218018

Multithreaded build #495

Multithreaded build #495

Conversation

novoselrok commented Jul 28, 2020

novoselrok commented Jul 28, 2020 • edited Loading

erikbern Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikbern Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikbern commented Jul 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

novoselrok commented Jul 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavlin-policar Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikbern commented Jul 31, 2020

novoselrok commented Jul 31, 2020

erikbern commented Jul 31, 2020

erikbern commented Jul 31, 2020

erikbern commented Aug 2, 2020

novoselrok commented Aug 2, 2020

novoselrok commented Aug 2, 2020

erikbern commented Aug 2, 2020

novoselrok commented Aug 2, 2020

novoselrok commented Aug 3, 2020 • edited Loading

erikbern commented Aug 3, 2020

novoselrok commented Aug 3, 2020

n1ru4l commented Nov 6, 2020 • edited Loading

novoselrok commented Nov 6, 2020

n1ru4l commented Nov 6, 2020

n1ru4l commented Nov 9, 2020

utkarshgupta137 commented Aug 20, 2021 • edited Loading

AniruddhaHumane commented Jan 19, 2023

utkarshgupta137 commented Jan 19, 2023

AniruddhaHumane commented Jan 20, 2023

novoselrok commented Jul 28, 2020 •

edited

Loading

erikbern Jul 28, 2020 •

edited

Loading

erikbern Jul 28, 2020 •

edited

Loading

pavlin-policar Jul 28, 2020 •

edited

Loading

novoselrok commented Aug 3, 2020 •

edited

Loading

n1ru4l commented Nov 6, 2020 •

edited

Loading

utkarshgupta137 commented Aug 20, 2021 •

edited

Loading